<!--Copyright 2022 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

# Quickstart

At its core, 🤗 Optimum uses _configuration objects_ to define parameters for optimization on different accelerators. These objects are then used to instantiate dedicated _optimizers_, _quantizers_, and _pruners_.

Before applying quantization or optimization, we first need to export our model to the ONNX format.

```python
>>> from optimum.onnxruntime import ORTModelForSequenceClassification
>>> from transformers import AutoTokenizer

>>> model_checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
>>> save_directory = "tmp/onnx/"
>>> # Load a model from transformers and export it to ONNX
>>> ort_model = ORTModelForSequenceClassification.from_pretrained(model_checkpoint, export=True)
>>> tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
>>> # Save the onnx model and tokenizer
>>> ort_model.save_pretrained(save_directory)
>>> tokenizer.save_pretrained(save_directory)  # doctest: +IGNORE_RESULT
```

Let's see now how we can apply dynamic quantization with ONNX Runtime:

```python
>>> from optimum.onnxruntime.configuration import AutoQuantizationConfig
>>> from optimum.onnxruntime import ORTQuantizer
>>> # Define the quantization methodology
>>> qconfig = AutoQuantizationConfig.arm64(is_static=False, per_channel=False)
>>> quantizer = ORTQuantizer.from_pretrained(ort_model)
>>> # Apply dynamic quantization on the model
>>> quantizer.quantize(save_dir=save_directory, quantization_config=qconfig)  # doctest: +IGNORE_RESULT
```

In this example, we've quantized a model from the Hugging Face Hub, but it could also be a path to a local model directory. The result from applying the `quantize()` method is a `model_quantized.onnx` file that can be used to run inference. Here's an example of how to load an ONNX Runtime model and generate predictions with it:

```python
>>> from optimum.onnxruntime import ORTModelForSequenceClassification
>>> from transformers import pipeline, AutoTokenizer
>>> model = ORTModelForSequenceClassification.from_pretrained(save_directory, file_name="model_quantized.onnx")
>>> tokenizer = AutoTokenizer.from_pretrained(save_directory)
>>> cls_pipeline = pipeline("text-classification", model=model, tokenizer=tokenizer)
>>> results = cls_pipeline("I love burritos!")
```

Similarly, you can apply static quantization by simply setting `is_static` to `True` when instantiating the `QuantizationConfig` object:

```python
>>> qconfig = AutoQuantizationConfig.arm64(is_static=True, per_channel=False)
```

Static quantization relies on feeding batches of data through the model to estimate the activation quantization parameters ahead of inference time. To support this, 🤗 Optimum allows you to provide a _calibration dataset_. The calibration dataset can be a simple `Dataset` object from the 🤗 Datasets library, or any dataset that's hosted on the Hugging Face Hub. For this example, we'll pick the [`sst2`](https://huggingface.co/datasets/glue/viewer/sst2/test) dataset that the model was originally trained on:

```python
>>> from functools import partial
>>> from optimum.onnxruntime.configuration import AutoCalibrationConfig

# Define the processing function to apply to each example after loading the dataset
>>> def preprocess_fn(ex, tokenizer):
...     return tokenizer(ex["sentence"])

>>> # Create the calibration dataset
>>> calibration_dataset = quantizer.get_calibration_dataset(
...     "glue",
...     dataset_config_name="sst2",
...     preprocess_function=partial(preprocess_fn, tokenizer=tokenizer),
...     num_samples=50,
...     dataset_split="train",
... )
>>> # Create the calibration configuration containing the parameters related to calibration.
>>> calibration_config = AutoCalibrationConfig.minmax(calibration_dataset)
>>> # Perform the calibration step: computes the activations quantization ranges
>>> ranges = quantizer.fit(
...     dataset=calibration_dataset,
...     calibration_config=calibration_config,
...     operators_to_quantize=qconfig.operators_to_quantize,
... )
>>> # Apply static quantization on the model
>>> quantizer.quantize(    # doctest: +IGNORE_RESULT
...     save_dir=save_directory,
...     calibration_tensors_range=ranges,
...     quantization_config=qconfig,
... )
```

As a final example, let's take a look at applying _graph optimizations_ techniques such as operator fusion and constant folding. As before, we load a configuration object, but this time by setting the optimization level instead of the quantization approach:

```python
>>> from optimum.onnxruntime.configuration import OptimizationConfig

>>> # Here the optimization level is selected to be 1, enabling basic optimizations such as redundant node eliminations and constant folding. Higher optimization level will result in a hardware dependent optimized graph.
>>> optimization_config = OptimizationConfig(optimization_level=1)
```

Next, we load an _optimizer_ to apply these optimisations to our model:

```python
>>> from optimum.onnxruntime import ORTOptimizer

>>> optimizer = ORTOptimizer.from_pretrained(ort_model)

>>> # Optimize the model
>>> optimizer.optimize(save_dir=save_directory, optimization_config=optimization_config)  # doctest: +IGNORE_RESULT
```

And that's it - the model is now optimized and ready for inference! As you can see, the process is similar in each case:

1. Define the optimization / quantization strategies via an `OptimizationConfig` / `QuantizationConfig` object
2. Instantiate a `ORTQuantizer` or `ORTOptimizer` class
3. Apply the `quantize()` or `optimize()` method
4. Run inference

Check out the [`examples`](https://github.com/huggingface/optimum/tree/main/examples) directory for more sophisticated usage.

Happy optimising 🤗!
